9 research outputs found
Audio-Visual Speaker Verification via Joint Cross-Attention
Speaker verification has been widely explored using speech signals, which has
shown significant improvement using deep models. Recently, there has been a
surge in exploring faces and voices as they can offer more complementary and
comprehensive information than relying only on a single modality of speech
signals. Though current methods in the literature on the fusion of faces and
voices have shown improvement over that of individual face or voice modalities,
the potential of audio-visual fusion is not fully explored for speaker
verification. Most of the existing methods based on audio-visual fusion either
rely on score-level fusion or simple feature concatenation. In this work, we
have explored cross-modal joint attention to fully leverage the inter-modal
complementary information and the intra-modal information for speaker
verification. Specifically, we estimate the cross-attention weights based on
the correlation between the joint feature presentation and that of the
individual feature representations in order to effectively capture both
intra-modal as well inter-modal relationships among the faces and voices. We
have shown that efficiently leveraging the intra- and inter-modal relationships
significantly improves the performance of audio-visual fusion for speaker
verification. The performance of the proposed approach has been evaluated on
the Voxceleb1 dataset. Results show that the proposed approach can
significantly outperform the state-of-the-art methods of audio-visual fusion
for speaker verification
Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention
Automatic emotion recognition (ER) has recently gained lot of interest due to
its potential in many real-world applications. In this context, multimodal
approaches have been shown to improve performance (over unimodal approaches) by
combining diverse and complementary sources of information, providing some
robustness to noisy and missing modalities. In this paper, we focus on
dimensional ER based on the fusion of facial and vocal modalities extracted
from videos, where complementary audio-visual (A-V) relationships are explored
to predict an individual's emotional states in valence-arousal space. Most
state-of-the-art fusion techniques rely on recurrent networks or conventional
attention mechanisms that do not effectively leverage the complementary nature
of A-V modalities. To address this problem, we introduce a joint
cross-attentional model for A-V fusion that extracts the salient features
across A-V modalities, that allows to effectively leverage the inter-modal
relationships, while retaining the intra-modal relationships. In particular, it
computes the cross-attention weights based on correlation between the joint
feature representation and that of the individual modalities. By deploying the
joint A-V feature representation into the cross-attention module, it helps to
simultaneously leverage both the intra and inter modal relationships, thereby
significantly improving the performance of the system over the vanilla
cross-attention module. The effectiveness of our proposed approach is validated
experimentally on challenging videos from the RECOLA and AffWild2 datasets.
Results indicate that our joint cross-attentional A-V fusion model provides a
cost-effective solution that can outperform state-of-the-art approaches, even
when the modalities are noisy or absent.Comment: arXiv admin note: substantial text overlap with arXiv:2203.14779,
arXiv:2111.0522
Holistic Guidance for Occluded Person Re-Identification
In real-world video surveillance applications, person re-identification
(ReID) suffers from the effects of occlusions and detection errors. Despite
recent advances, occlusions continue to corrupt the features extracted by
state-of-art CNN backbones, and thereby deteriorate the accuracy of ReID
systems. To address this issue, methods in the literature use an additional
costly process such as pose estimation, where pose maps provide supervision to
exclude occluded regions. In contrast, we introduce a novel Holistic Guidance
(HG) method that relies only on person identity labels, and on the distribution
of pairwise matching distances of datasets to alleviate the problem of
occlusion, without requiring additional supervision. Hence, our proposed
student-teacher framework is trained to address the occlusion problem by
matching the distributions of between- and within-class distances (DCDs) of
occluded samples with that of holistic (non-occluded) samples, thereby using
the latter as a soft labeled reference to learn well separated DCDs. This
approach is supported by our empirical study where the distribution of between-
and within-class distances between images have more overlap in occluded than
holistic datasets. In particular, features extracted from both datasets are
jointly learned using the student model to produce an attention map that allows
separating visible regions from occluded ones. In addition to this, a joint
generative-discriminative backbone is trained with a denoising autoencoder,
allowing the system to self-recover from occlusions. Extensive experiments on
several challenging public datasets indicate that the proposed approach can
outperform state-of-the-art methods on both occluded and holistic datasetsComment: 10 page
Crowd Flow Segmentation based on Motion Vectors in H.264 Compressed Domain
In this work, we have explored the prospect of segmenting crowd flow in H. 264 compressed videos by merely using motion vectors. The motion vectors are extracted by partially decoding the corresponding video sequence in the H. 264 compressed domain. The region of interest ie., crowd flow region is extracted and the motion vectors that spans the region of interest is preprocessed and a collective representation of the motion vectors for the entire video is obtained. The obtained motion vectors for the corresponding video is then clustered by using EM algorithm. Finally, the clusters which converges to a single flow are merged together based on the bhattacharya distance measure between the histogram of the of the orientation of the motion vectors at the boundaries of the clusters. We had implemented our proposed approach on the complex crowd flow dataset provided by 1] and compared our results by using Jaccard measure. Since we are performing crowd flow segmentation in the compressed domain using only motion vectors, our proposed approach performs much faster compared to other pixel domain counterparts still retaining better accuracy
Deep DA for Ordinal Regression of Pain Intensity Estimation Using Weakly-Labeled Videos
Automatic estimation of pain intensity from facial expressions in videos has
an immense potential in health care applications. However, domain adaptation
(DA) is needed to alleviate the problem of domain shifts that typically occurs
between video data captured in source and target do-mains. Given the laborious
task of collecting and annotating videos, and the subjective bias due to
ambiguity among adjacent intensity levels, weakly-supervised learning (WSL)is
gaining attention in such applications. Yet, most state-of-the-art WSL models
are typically formulated as regression problems, and do not leverage the
ordinal relation between intensity levels, nor the temporal coherence of
multiple consecutive frames. This paper introduces a new deep learn-ing model
for weakly-supervised DA with ordinal regression(WSDA-OR), where videos in
target domain have coarse la-bels provided on a periodic basis. The WSDA-OR
model enforces ordinal relationships among the intensity levels as-signed to
the target sequences, and associates multiple relevant frames to sequence-level
labels (instead of a single frame). In particular, it learns discriminant and
domain-invariant feature representations by integrating multiple in-stance
learning with deep adversarial DA, where soft Gaussian labels are used to
efficiently represent the weak ordinal sequence-level labels from the target
domain. The proposed approach was validated on the RECOLA video dataset as
fully-labeled source domain, and UNBC-McMaster video data as weakly-labeled
target domain. We have also validated WSDA-OR on BIOVID and Fatigue (private)
datasets for sequence level estimation. Experimental results indicate that our
approach can provide a significant improvement over the state-of-the-art
models, allowing to achieve a greater localization accuracy.Comment: This is a multiple copy of the same versio
Compressed domain human action recognition in H.264/AVC video streams
This paper discusses a novel high-speed approach for human action recognition in H.264/AVC compressed domain. The proposed algorithm utilizes cues from quantization parameters and motion vectors extracted from the compressed video sequence for feature extraction and further classification using Support Vector Machines (SVM). The ultimate goal of the proposed work is to portray a much faster algorithm than pixel domain counterparts, with comparable accuracy, utilizing only the sparse information from compressed video. Partial decoding rules out the complexity of full decoding, and minimizes computational load and memory usage, which can result in reduced hardware utilization and faster recognition results. The proposed approach can handle illumination changes, scale, and appearance variations, and is robust to outdoor as well as indoor testing scenarios. We have evaluated the performance of the proposed method on two benchmark action datasets and achieved more than 85 % accuracy. The proposed algorithm classifies actions with speed (> 2,000 fps) approximately 100 times faster than existing state-of-the-art pixel-domain algorithms
Reliability studies on Si PIN photodiodes under Co-60 gamma radiation
Silicon PIN photodiodes were fabricated with 250 nm SiO2 antireflective coating (ARC). The changes in the electrical characteristics, capacitance-voltage characteristics and spectral response after gamma irradiation are systematically studied to estimate the radiation tolerance up to 10 Mrad. The different characteristics studied in this investigation demonstrate that Si PIN photodiodes are suitable for high radiation environment